Skip to content

Conversation

@tomerqodo
Copy link

@tomerqodo tomerqodo commented Dec 4, 2025

User description

Benchmark PR elastic#138126

Type: Clean (correct implementation)

Original PR Title: Fix stats performance
Original PR Description: This fixes the N^2 performance problem described in elastic#97222. In addition to restoring the previous partial fix (elastic#130857), it does the following:

  1. IndicesQueryCache::getStats now accepts a Supplier so that we can only call IndicesQueryCache::getSharedRamSizeForAllShards if it is absolutely needed. This fixes an N^2 performance problem that Improving statsByShard performance when the number of shards is very large elastic/elasticsearch#130857 introduced. If a user called TransportIndicesStatsAction but did not request query cache stats, then before Improving statsByShard performance when the number of shards is very large elastic/elasticsearch#130857 we did not enter the N^2 loop (it was only entered if a user did request query cache stats). But after Improving statsByShard performance when the number of shards is very large elastic/elasticsearch#130857, we had the N^2 performance all the time. This is a pretty big problem for clusters with large shards since this is called very frequently (including every 30 seconds by a background task).
  2. It fixes the N^2 performance in TransportIndicesStatsAction by sharing state across all shardOperation calls on a single node using the new NodeContext feature from Adding NodeContext to TransportBroadcastByNodeAction elastic/elasticsearch#138057.

Closes elastic#97222
Original PR URL: elastic#138126


PR Type

Bug fix, Enhancement


Description

  • Fix N^2 performance problem in stats APIs by caching shared RAM calculations

  • Introduce CacheTotals record and refactor shared RAM distribution logic

  • Update IndicesQueryCache.getStats() to accept precomputed shared RAM supplier

  • Modify TransportIndicesStatsAction to use NodeContext for state sharing across shards

  • Update TransportClusterStatsAction to cache shared RAM calculations per node


Diagram Walkthrough

flowchart LR
  A["Stats API Requests"] -->|"Previously: O(N²) loop"| B["IndicesQueryCache.getStats"]
  A -->|"Now: Cached once per node"| C["CachedSupplier wrapper"]
  C -->|"Computes totals once"| D["getCacheTotalsForAllShards"]
  D -->|"Distributes to shards"| E["getSharedRamSizeForShard"]
  E -->|"Returns precomputed value"| B
Loading

File Walkthrough

Relevant files
Bug fix, enhancement
3 files
IndicesQueryCache.java
Refactor shared RAM calculation with CacheTotals record   
+81/-33 
TransportIndicesStatsAction.java
Use NodeContext to cache query cache totals across shards
+19/-3   
TransportClusterStatsAction.java
Cache shared RAM calculations using CachedSupplier             
+16/-3   
Enhancement
2 files
CommonStats.java
Accept precomputed shared RAM supplier in getShardLevelStats
+8/-2     
IndicesService.java
Pass precomputed shared RAM to indexShardStats method       
+16/-8   
Tests
5 files
IndicesQueryCacheTests.java
Update tests to use new getStats supplier parameter           
+129/-82
IndicesServiceCloseTests.java
Update cache stats calls with precomputed shared RAM         
+7/-7     
IndicesServiceTests.java
Update mocks to handle new indexShardStats signature         
+10/-4   
VersionStatsTests.java
Update getShardLevelStats call with shared RAM supplier   
+1/-1     
IndexShardTests.java
Update getShardLevelStats call with shared RAM supplier   
+1/-1     
Documentation
1 files
138126.yaml
Add changelog entry for performance improvement                   
+6/-0     

@qodo-code-review
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
No audit logs: The new performance-related logic (e.g., caching suppliers and node context) introduces
critical stats behavior changes without adding any audit logging for accesses or
computations that could be considered sensitive operational actions.

Referred Code
@Override
protected Supplier<IndicesQueryCache.CacheTotals> createNodeContext() {
    return CachedSupplier.wrap(() -> IndicesQueryCache.getCacheTotalsForAllShards(indicesService));
}

@Override
protected void shardOperation(
    IndicesStatsRequest request,
    ShardRouting shardRouting,
    Task task,
    Supplier<IndicesQueryCache.CacheTotals> context,
    ActionListener<ShardStats> listener
) {
    ActionListener.completeWith(listener, () -> {
        assert task instanceof CancellableTask;
        IndexService indexService = indicesService.indexServiceSafe(shardRouting.shardId().getIndex());
        IndexShard indexShard = indexService.getShard(shardRouting.shardId().id());
        CommonStats commonStats = CommonStats.getShardLevelStats(
            indicesService.getIndicesQueryCache(),
            indexShard,
            request.flags(),


 ... (clipped 6 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Null cache handling: New helper methods compute shared RAM and stats across shards but rely on external
services and suppliers without explicit error handling or null checks beyond basic
ternaries, which may require verification in broader context.

Referred Code
public static Map<ShardId, Long> getSharedRamSizeForAllShards(IndicesService indicesService) {
    Map<ShardId, Long> shardIdToSharedRam = new HashMap<>();
    IndicesQueryCache.CacheTotals cacheTotals = IndicesQueryCache.getCacheTotalsForAllShards(indicesService);
    for (IndexService indexService : indicesService) {
        for (IndexShard indexShard : indexService) {
            final var queryCache = indicesService.getIndicesQueryCache();
            long sharedRam = (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), cacheTotals);
            // as a size optimization, only store non-zero values in the map
            if (sharedRam > 0L) {
                shardIdToSharedRam.put(indexShard.shardId(), sharedRam);
            }
        }
    }
    return Collections.unmodifiableMap(shardIdToSharedRam);
}

public long getCacheSizeForShard(ShardId shardId) {
    Stats stats = shardStats.get(shardId);
    return stats != null ? stats.cacheSize : 0L;
}



 ... (clipped 47 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Hoist query cache retrieval out of loop

Hoist the indicesService.getIndicesQueryCache() call out of the nested loop and
add an early null check to improve performance.

server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java [83-97]

 public static Map<ShardId, Long> getSharedRamSizeForAllShards(IndicesService indicesService) {
+    final var queryCache = indicesService.getIndicesQueryCache();
+    if (queryCache == null) {
+        return Collections.emptyMap();
+    }
     Map<ShardId, Long> shardIdToSharedRam = new HashMap<>();
     IndicesQueryCache.CacheTotals cacheTotals = IndicesQueryCache.getCacheTotalsForAllShards(indicesService);
     for (IndexService indexService : indicesService) {
         for (IndexShard indexShard : indexService) {
-            final var queryCache = indicesService.getIndicesQueryCache();
-            long sharedRam = (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), cacheTotals);
+            long sharedRam = queryCache.getSharedRamSizeForShard(indexShard.shardId(), cacheTotals);
             // as a size optimization, only store non-zero values in the map
             if (sharedRam > 0L) {
                 shardIdToSharedRam.put(indexShard.shardId(), sharedRam);
             }
         }
     }
     return Collections.unmodifiableMap(shardIdToSharedRam);
 }
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a performance improvement by hoisting the getIndicesQueryCache() call out of a nested loop, which is a valid optimization.

Low
Avoid redundant query cache retrieval

Remove the redundant indicesService.getIndicesQueryCache() call inside the
lambda by capturing and reusing the queryCache from the outer scope.

server/src/main/java/org/elasticsearch/action/admin/indices/stats/TransportIndicesStatsAction.java [135-143]

 CommonStats commonStats = CommonStats.getShardLevelStats(
     indicesService.getIndicesQueryCache(),
     indexShard,
     request.flags(),
     () -> {
         final IndicesQueryCache queryCache = indicesService.getIndicesQueryCache();
-        return (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), context.get());
+        if (queryCache == null) {
+            return 0L;
+        }
+        return queryCache.getSharedRamSizeForShard(indexShard.shardId(), context.get());
     }
 );
  • Apply / Chat
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly points out a redundant call to indicesService.getIndicesQueryCache() inside a lambda and proposes a valid simplification for better code clarity.

Low
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants